A Context-theoretic Framework for 
Compositionality in Distributional Semantics 



Daoud Clarke* 

University of Hertfordshire 



1. Introduction 



In recent years, the abundance of text corpora and computing power has allowed 
the development of techniques to analyse statistical properties of words. For exam- 
ple techniques such as latent semantic analysis (Deerwe ster et al. 19901 1 and its vari- 
ants, and measures of distributional similarity dLin 19981 ILee 1 999 ) attempt to derive 
aspects of the meanings of words by statistical analysis, while statistical informa- 
tion is often used when parsing to determine sentence structure (Collins 1997). These 
techniques have proved useful in many applications within computational linguistics 
and natural language processing ( Schiitze 1998 : McCarthy et al. 2004 : Grefenst ette 19941 
ILin 20031 Bellegarda 2000 Choi, Wiemer-Hastings, and Moore 2001} , arguably provid- 



ing evidence that they capture something about the nature of words that should be 
included in representations of their meaning. However, it is very difficult to reconcile 
these techniques with existing theories of meaning in language, which revolve around 
logical and ontological representations. The new techniques, almost without exception, 
can be viewed as dealing with vector-based representations of meaning, placing mean- 
ing (at least at the word level) within the realm of mathematics and algebra; conversely 
the older theories of meaning dwell in the realm of logic and ontology. It seems there 
is no unifying theory of meaning to provide guidance to those making use of the new 
techniques. 

The problem appears to be a fundamental one in computational linguistics since 
the whole foundation of meaning seems to be in question. The older, logical theories 
often subscribe to a model-theoretic philosophy of meaning l [Kamp and Reyle"l 993 



Blackburn and Bos 2005 ) According to this approach, sentences should be translatec 
to a logical form that can be interpreted as a description of the state of the world. 
The new vector-based techniques, on the other hand, are often closer in spirit to the 
philosophy of "meaning as context", that the meaning of an expression is determined by 
how it is used. This is an old idea with origins in the philosophy of Witt genstein (1953) , 
who said that "meaning just is use" and |Firth (1957) , "You shall know a word by the 
company it keeps", and the distributional hypothesis of |Harris (1968) , that words will 
occur in similar contexts if and only if they have similar meanings. This hypothesis 
is justified by the success of techniques such as latent semantic analysis as well as 
experimental evidence (Miller and Charl es! 99H . Whilst the two philosophies are not 
obviously incompatible — especially since the former applies mainly at the sentence 
level and the latter mainly at the word level — it is not clear how they relate to each 
other. 
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Figure 1 

Method of Approach in developing the Context-theoretic Framework. 



The problem of how to compose vector representations of mean- 
ings of words has recently received increased attention jWiddows2 008 : 



Clark, Coecke, and Sadrzadeh 20081 |Mitchell and Lapata 2008 



Preller and Sadrzadeh 2009 : Guevara 201 1 : 



though the problem has 
ILandauer and Dumais 19971 



m 



been considered 
Foltz, Kintsch, and Landauer 1998 : 



lErk and Pado 20091 
Baroni and Zamparell i 2010)> al- 
earlier work IjSmolensky 1990 



Kintsch 2001). A 



solution to this problem would have practical as well as philosophical benefits. Current 
techniques such as latent semantic analysis work well at the word level, but we cannot 
extend them much beyond this, to the phrase or sentence level, without quickly 
encountering the data-sparseness problem: there are not enough occurrences of strings 
of words to determine what their vectors should be merely by looking in corpora. If we 
knew how such vectors should compose then we would be able to extend the benefits 
of the vector based techniques to the many applications that require reasoning about 
the meaning of phrases and sentences. 

This paper describes the results of our own efforts to identify a theory that can 
unite these two paradigms, and includes a summary of work described in the author's 
DPhil thesis (Clarke 2007} • hi addition, we also discuss the relationship between this 
theory and methods of composition that have recently been proposed in the literature, 
showing that many of them can be considered as falling within our framework. 

Our approach in identifying the framework is summarised in Figure [T] 



Inspired by the philosophy of meaning as context and vector based 
techniques we developed a mathematical model of meaning as context, in 
which the meaning of a string is a vector representing contexts in which 
that string occurs in a hypothetical infinite corpus. 
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• The theory on its own is not useful when applied to real world corpora 
because of the problem of data sparseness. Instead we examine the 
mathematical propertes of the model, and abstract them to form a 
framework which contains many of the properties of the model. 
Implementations of the framework are called context theories since they 
can be viewed as theories about the contexts in which strings occur. By 
analogy with the term "model-theoretic" we use the term 
"context-theoretic" for concepts relating to context theories, thus we call 
our framework the context-theoretic framework. 

• In order to ensure that the framework was practically useful, context 
theories were developed in parallel with the framework itself. The aim 
was to be able to describe existing approaches to representing meaning 
within the framework as fully as possible. 

In developing the framework we were looking for specific properties; namely, we 
wanted it to: 

• provide some guidelines describing in what way the representation of a 
phrase or sentence should relate to the representations of the individual 
words as vectors; 

• require information about the probability of a string of words to be 
incorporated into the representation; 

• provide a way to measure the degree of entailment between strings based 
on the particular meaning representation; 

• be general enough to encompass logical representations of meaning; 

• be able to incorporate the representation of ambiguity and uncertainty, 
including statistical information such as the probability of a parse or the 
probability that a word takes a particular sense. 

The framework we present is abstract, and hence does not subscribe to a particular 
method for obtaining word vectors: they may be raw frequency counts, or vectors ob- 
tained by a method such as latent semantic analysis. Nor does the framework provide a 
recipe for how to represent meaning in natural language, instead it provides restrictions 
on the set of possibilities. The advantage of the framework is in ensuring that techniques 
are used in a way that is well-founded in a theory of meaning. For example, given 
vector representations of words, there is not one single way of combining these to give 
vector representations of phrases and sentences, but in order to fit within the framework 
there are certain properties of the representation that need to hold. Any method of 
combining these vectors in which these properties hold can be considered within the 
framework and is thus justified according to the underlying theory; in addition the 
framework instructs us as to how to measure the degree of entailment between strings 
according to that particular method. We will attempt to show the broad applicability of 
the framework by applying it to problems in natural language processing. 
The contribution of this paper is as follows: 

• We define the context-theoretic framework and introduce the mathematics 
necessary to understand it. The description presented hear is cleaner than 
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that of jClarke 20071 1, and in addition we provide examples which should 
provide intuition for the concepts we describe. 

We relate the framework to methods of composition that have been 
proposed in the literature, namely: 

vector addition dLandauer and Dumais 19971 

Folt z, Kintsch, and Landauer 1 998 ) 



the tensor product ( Smolensky 1990 IClark andP ulman 2007: 
IWiddows 20081 

the multiplicative mod els of|Mitchell and Lapata ( 2008) 



matrix multiplication {Rudolph and Giesbrecht 2010 



Baroni and Zamparelli 2010} 
the approach of |Clark, Coecke, and Sadrzade h (2008). 



2. Context Theory 



In this section, we define the fundamental concept of our concern, a context theory and 
discuss its properties. 



Definition 1 (Context Theory) 

A context theory is a tuple (A, A, £, V, ifi), where A is a set (the alphabet), A is a unital 
algebra over the real numbers, £ is a function from A to A, V is an abstract Lebesgue 
space and rp is an injective linear map from A to V. 



We will explain each part of this definition, introducing the necessary mathematics 
as we proceed. We assume the reader is familiar with linear algebra; see (Halmos 1974) 
for definitions that are not included here. 



2.1 Algebra over a field 

We have identified an algebra over a field as an important construction since it gener- 
alises nearly all the methods of vector-based composition that have been proposed. 

Definition 2 (Algebra over a field) 

An algebra over a field (or simply algebra when there is no ambiguity) is a vector space 
A over a field K together with a binary operation (a, b) h4 ab on A that is bilinear, i.e. 

a{ab + ftc) — aab + ftac 
(aa + /3b)c = aac + ftbc 

and associative, i.e. {ab)c = a(bc) for all a, b, c G A and all a, ft G An algebra is called 
unital if it has a distinguished unity element 1 satisfying lx = xl = x for all x G A. 



1 Some authors do not place the requirement that an algebra is associative, in which case our definition 
would refer to an associative algebra. 



4 



Daoud Clarke 



A Context-theoretic Framework for Distributional Semantics 





d x 


d 2 


d 3 


cat 





2 


3 


animal 


2 


1 


2 


big 


1 


3 






Table 1 

Example of possible occurrences for three terms in three different contexts. 



We are generally only interested in real algebras, i.e. the situation where K is the field 
of real numbers, K. 

Example 1 

The square real-valued matrices of order n form a real unital associative algebra under 
standard matrix multiplication. The vector operations are defined entry- wise. The unity 
element of the algebra is the identity matrix. 

This means that our proposal is more general than that of 
Rudolph and Giesbrecht (2010 ), who suggest using matrix multiplication as a 
framework for distributional semantic composition. The main differences in our 
proposal are: 

• We allow dimensionality to be infinite, instead of restricting ourselves to 
finite-dimensional matrices; 

• Matrix algebras form a ^-algebra, whereas we do not currently place this 
requirement; 

• We emphasise the order structure that is inherent in real vector spaces 
when there is a distinguished basis. 

The purpose of £ in the context theory is to associate elements of the algebra with 
strings of words. Considering only the multiplication of A (and ignoring the vector 
operations), A is a monoid, since we assumed that the multiplication on A is associative. 
Then £ induces a monoid homomorphism a H» a from A* to A. We denote the mapped 
value of a 6 A* by a 6 A, which is defined as follows: 

a = £,(ai)(,(a 2 ) ■ ■ .£(<x„) 

where a = a\a 2 . ■ . a n for € A, and we define e = 1, where e is the empty string. Thus, 
the mapping defined by * allows us to associate an element of the algebra with every 
string of words. 

The algebra is what tells us how meanings compose. A crucial part of our thesis is 
that meanings can be represented by elements of an algebra, and that the type of compo- 
sition that can be defined using an algebra is general enough to describe the composition 
of meaning in natural language. To go some way towards justifying this, we give several 
examples of algebras that describe methods of composition that have been proposed in 
the literature: namely point-wise m ultiplication <|Mitchell and Lapata 2008[ l, vector ad- 
dition l ILandauer and D umais 1 997| |Foltz, Kintsch, and Landauer 1998| l and the tensor 
product | [Smolenskyl990]|Clark and Pulman 2007HWIddows 20081 1. 
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Example 2 (Point-wise multiplication) 

Consider the n-dimensional real vector space W l . We describe a vector u G M™ in terms 
of its components as {u\, U2, ■ ■ ■ u n ) with each m G HL We can define a multiplication ■ 
on this space by 

(ui,M 2 , • • • ,U n ) ■ (Vl, V2, . . . , V n ) = {U1V1,U 2 V2 1 ■ ■ ■ U n V n ) 

It is easy to see that this satisfies the requirements for an algebra specified above. Table 
12.11 shows a simple example of possible occurrences for three terms in three different 
contexts, di,d,2 and (I3 which may, for example, represent documents. We use this to 
define the mapping £ from terms to vectors. Thus, in this example, we have £(cat) = 
(0, 2, 3) and £,{big) — (1, 3, 0). Under point-wise multiplication, we would have 



big cat = £(big) ■ £(cat) = (1, 3, 0) • (0, 2, 3) = (0, 6, 0). 

One commonly used operation for composing vector-based representations of 
meaning is vector addition. As noted by Rudolph and Giesbrech t (20101 , this can be 
described using matrix multiplication, by embedding an n-dimensional vector u into 
a matrix of order n + 1: 



fa u x u 2 ■ ■ ■ u n \ 

a ••• 

a ••■0 

\0 ••• a J 



where a = 1. The set of all such matrices, for all real values of a, forms a subalgebra of 
the algebra of matrices of order n + 1. A subalgebra of an algebra A is a sub-vector space 
of A which is closed under the multiplication of A. This subalgebra can be equivalently 
described as follows: 

Example 3 (Additive algebra) 

For two vectors u = (a, u%, U2, ■ ■ ■ u n ) and v — ((3, v%, V2 ■ ■ ■ v n ) in M. n+1 , we define the 
additive product EB by 

u ffl v = (a/3, avi + /3u%, av2 + /3u2, ■ ■ ■ ctv n + f3u n ) 

To verify that this multiplication makes R n+1 an algebra, we can directly verify the 
bilinear and associativity requirements, or check that it is isomorphic to the subalgebra 
of matrices discussed above. 

Using the table from the previous example, we define £+ so that it maps n- 
dimensional context vectors to R n+1 , where the first component is 1, so £,+ {big) = 
(1, 1,3,0) and £+(caf) = (1,0, 2, 3) and 



big cat = !i + {big) ffl Z+{cat) = (1, 1, 5, 3). 

Point-wise multiplication and addition are not attractive as methods for composing 
meaning in natural language since they are commutative, whereas natural language 
is inherently non-commutative. One obvious method of composing vectors that is not 
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commutative is the tensor product. This method of composition can be viewed as a 
product in an algebra by considering the tensor algebra, which is formed from direct 
sums of all tensor powers of a base vector space. 

We assume the reader is familiar with the tensor product and direct sum (see 
(Halm os 1974| | for definitions); we recall their basic properties here. Let V n denote a 
vector space of dimensionality n (note that all vector spaces of a fixed dimensionality 
are isomorphic). Then the tensor product space V n <g> V m is isomorphic to a space V nm of 
dimensionality nm; moreover given orthonormal bases B = {bi, b%, . . . , b n } for V n and 
C = {ci, C2, . . . , Cm} for V m there is an orthonormal basis for V nm defined by 

{bi ® Cj : : 1 < i < n and 1 < j < m). 

Example 4 

The multiplicative models of dMitchell and Lapata 2008 ) correspond to the class of finite 
dimensional algebras. Let A be a finite-dimensional vector space. Then every associative 
bilinear product on A can be described by a linear function T from A ® A to A, as 
required in Mitchell and Lapata's model. To see this, consider the action of the product • 
on two orthonormal basis vectors a and b of A. This is a vector in A, thus we can define 
T(a Cg> b) — a ■ b. By considering all basis vectors, we can define the linear function T. 

If the tensor product can loosely be viewed as "multiplying" vector spaces, then the 
direct sum is like adding them; the space V n ®V m has dimensionality n + m and has 
basis vectors 



{bi © : 1 < i < n} U {0 © c - : 1 < j < m}\ 
it is usual to write b as b and © c as c. 



Example 5 (Tensor algebra) 

If V is a vector space, then we define T(V), the free algebra of tensor algebra generated 
by V as: 

T(V) = R®V ®{V ®V)®(V ®V ® V) © • • • 

where we assume that the direct sum is commutative. We can think of it as the direct 
sum of all tensor powers of V, with K representing the zeroth power. In order to make 
this space an algebra, we define the product on elements of these tensor powers, viewed 
as subspaces of the tensor algebra, as their tensor product. This is enough to define the 
product on the whole space, since every element can be written as a sum of tensor 
powers of elements of V. There is a natural embedding from V to T(V), where each 
element maps to an element in the first tensor power. Thus for example we can think of 
u,u® v, and u<E> v + w as elements of T(V), for all u,v,w G V. 

This product defines an algebra since the tensor product is a bilinear operation. 
Taking V = M. 3 and using £ as the natural embedding from the context vector of a string 
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d 2 d 3 d 5 d 6 d 7 d$ 
orange 



H 



d 2 d 3 d 5 d 6 d 7 
fruit 



I 



Q D 



d 2 d 3 d 5 d e d 7 d s 
orange A fruit 



Figure 2 

Vector representations of the terms orange and fruit based on hypothetical occurrences in six 
documents and their vector lattice meet (the darker shaded area). 



T(V), our previous example becomes 

big cat = ^(big) ®£(cat) 

= (1,3,0)0(0,2,3) 

= (1(0,2,3),3(0,2,3), 0(0,2,3)) 

£* (0,2,3,0,6,9,0,0,0) 

where the last two lines demonstrate how a vector in R 3 ® M 3 can be described in the 
isomorphic space R 9 . 

2.2 Vector lattices 

The next part of the definition specifies an abstract Lebesgue space. This is a special 
kind of vector lattice, or even more generally, a partially ordered vector space. 

Definition 3 (Partially ordered vector space) 

A partially ordered vector space V is a real vector space together with a partial ordering 
< such that: 

if x < y then x + z < y + z 
if x < y then ax < ay 

for all x,y,z e V, and for all a > 0. Such a partial ordering is called a vector space order 
on V. An element u of V satisfying u > is called a positive element; the set of all 
positive elements of V is denoted V + . If < defines a lattice on V then the space is called 
a vector lattice or Riesz space. 



Example 6 (Lattice operations on R") 

A vector lattice captures many properties that are inherent in real vector spaces when 
there is a distinguished basis. In R", given a specific basis, we can write two vectors u and 
v as sequences of numbers: u = (m, u 2 , ■ ■ ■ u n ) and v = (vi, v 2 , . . . v n ). This allows us to 
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define the lattice operations of meet A and join V as 

u A v — (min(iti , v\ ) , min(u2, 1)2), ■ ■ ■ mm(u n , v n )) 
u V v = (max(ui, vi), max(it2, V2), ■ ■ ■ max(w n , v n )) 



i.e. the component-wise minimum and maximum, respectively. A graphical depiction 
of the meet operation is shown in figure |2 

The vector operations of addition and multiplication by scalar, which can be defined in 
a similar component-wise fashion, are nevertheless independent of the particular basis 
chosen. This makes them particularly suited to physical applications, where it is often 
a requirement that there is no preferred direction. Conversely, the lattice operations de- 
pend on the choice of basis, so the operations as defined above would behave differently 
if the components were written using a different basis. We argue that it makes sense for 
us to consider these properties of vectors in the context of computational linguistics 
since we can often have a distinguished basis: namely the one defined by the contexts 
in which terms occur. Of course it is true that techniques such as latent semantic analysis 
introduce a new basis which does not have a clear interpretation in relation to contexts; 
nevertheless they nearly always identify a distinguished basis which we can use to 
define the lattice operations. 

We argue that the mere association of words with vectors is not enough to constitute 
a theory of meaning. Vector representations allow the measurement of similarity or 
distance, through an inner product or metric, however we believe it is also important 
for a theory of meaning to model entailment, a relation which plays an important role in 
logical theories of meaning. In propositional and first order logic, the entailment relation 
is a partial ordering, in fact it is a Boolean algebra, which is a special kind of lattice. It 
seems natural to consider whether the lattice structure that is inherent in the vector 
representations used in computational linguistics can be used to model entailment. 

We believe our framework is suited to all vector-based representations of natural 
language meaning, however the vectors are obtained. Given this assumption, we can 
only justify our assumption that the partial order structure of the vector space is suitable 
to represent the entailment relation by observing that it has the right kind of properties 
we would expect from this relation. 

There may, however, be more justification for this assumption, based on the case 
where the vectors for terms are simply their frequencies of occurrences in n different 
contexts, so that they are vectors in R". In this case, the relation £(x) < £(y) means that 
y occurs at least as frequently as x in every context. This means that y occurs in at 
least as wide a range of contexts as x, and occurs as least as frequently as x. Thus the 
statement "x entails y if and only if £(x) < £ (y)" can be viewed as a stronger form of the 
distributional hypothesis of Harris (1968j|. 

In fact, this idea can be related to the notion of "distributional generality", intro- 
duced by |Weeds, Weir, and McCarthy (2004} (see also l [Geffet and Pagan 2005} ). A term 
x is distributionally more general than another term y if x occurs in a subset of the 
contexts that y occurs in. The idea is that distributional generality may be connected to 
semantic generality . An example of this is the hypernymy or "is a" relation that is used to 
express generality of concepts in ontologies, for example, the term animal is a hypernym 
of dog since a dog is an animal. They explain the connection to distributional generality 
as follows: 
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Although one can obviously think of counter-examples, we would generally expect that 
the more specific term dog can only be used in contexts where animal can be used and 
that the more general term animal might be used in all of the contexts where dog is used 
and possibly others. Thus, we might expect that distributional generality is correlated 
with semantic generality. . . 

Our proposal, in the case where words are represented by frequency vectors, can 
be considered a stronger version of distributional generality, where the additional re- 
quirement is on the frequency of occurrences. In practice, this assumption is unlikely 
to be compatible with the ontological view of entailment. For example the term entity is 
semantically more general than the term animal, however entity is unlikely to occur more 
frequently in each context, since it is a rarer word. A more realistic foundation for this 
assumption might be if we were to consider the components for a word to represent the 
plausibility of observing the word in each context. The question then of course, is how 
such vectors might be obtained. Another possibility is to attempt to weight components 
in such a way that entailment becomes a plausible interpretation for the partial ordering 
relation. 

Even if we allow for such alternatives, however, in general it is unlikely that the 
relation will hold between any two strings, since u < v if and only if Ui < vt for each 
component, Ui, vi, of the two vectors. Instead, we propose to allow for degrees of entail- 
ment. We take a Bayesian perspective on this, and suggest that the degree of entailment 
should take the form of a conditional probability. In order to define this, however, we 
need some additional structure on the vector lattice that allows it to be viewed as a 
description of probability, by requiring it to be an "abstract Lebesgue space". 

Definition 4 (Banach lattice) 

A Banach lattice V is a vector lattice together with a norm || - || such that V is complete 
with respect to || • ||. 

Definition 5 (Abstract Lebesgue Space) 

An Abstract Lebesgue (or AL) space is a Banach lattice V such that 



for all u,v inV with u > 0, v > and u A v = 0. 
Example 7 (£ p spaces) 

Let u = (ui,U2, ■ ■ •) be an infinite sequence of real numbers. We can view ui as compo- 
nents of the infinite-dimensional vector u. We call the set of all such vectors the sequence 
space; it is a vector space where the operations are defined component-wise. We define 
a set of norms, the £ p -norms, on the space of all such vectors by 



The space of all vectors u for which is finite is called the l p space. Considered 
as vector spaces, these are Banach spaces, since they are complete with respect to the 
associated norm, and under the component-wise lattice operations, they are Banach 
lattices. In particular, the i 1 space is an abstract Lebesgue space under the I 1 norm. 



ll« + f|| = NI + H 
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The finite-dimensional real vector spaces K." can be considered as special cases of 
the sequence spaces (consisting of vectors in which all but n components are zero) and, 
since they are finite-dimensional, we can use any of the £ p norms. Thus, our previous 
examples, in which £ mapped terms to vectors in M n can be considered as mapping to 
abstract Lebesgue spaces, if we adopt the i 1 norm. 

2.3 Degrees of entailment 

An abstract Lebesgue space has many of the properties of a measure space, where the 
set operations of a measure space are replaced by the lattice operations of the vector 
space. This means that we can think of an abstract Lebesgue space as a vector-based 
probability space. Here, events correspond to positive elements with norm less than or 
equal to 1 ; the probability of an event u is given by the norm (which we shall always 
assume is the 1 1 norm), and the joint probability of two events u and v is \\u A V || i. 

Definition 6 (Degree of entailment) 

We consider the degree to which u entails v to be the conditional probability of v given 

u: 

^ , . ||uAw||i 
Ent(u,v) = " I, I, H1 . 

Fill 

If we are only interested in degrees of entailment (i.e. conditional probabilities) and 
not probabilities, then we can drop the requirement that the norm should be less 
than or equal to one, since conditional probabilities are automatically normalised. This 
definition, together with the multiplication of the algebra, allows us to compute the 
degree of entailment between any two strings according to the context theory. 

Example 8 

The vectors given in Table l2Tl give the following calculation for the degree to which cat 
entails animal: 

£(cat) = (0,2,3) 
^(animal) = (2,1,2) 
£(cai) A ^(animal) = (0, 1, 2) 
Ent(£ (cat), £ (animal)) = \\^(cat) A £(amma/)||i/||£(cai)||i =3/5 

An important question is how this context-theoretic definition of the degree of 
entailment relates to more familiar notions of entailment]! There are three main ways in 
which the term entailment is used: 

• The model-theoretic sense of entailment in which a theory A entails a 
theory D if every model of .4 is also a model of D. It was shown in 
(Clarke 20071 that this type of entailment can be described using context 
theories, where sentences are represented as projections on a vector space. 



2 Thanks are due to the anonymous reviewer who identified this question and related issues. 
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Entailment between terms in the word net hierarchy, for example the 
hypernymy or is-a relation between the terms cat and animal encodes the 
fact that a cat is an animal. In (Clarke 2007) we showed that such relations 
can be encoded in the partial order structure of a vector lattice. 

Human common-sense judgments as to whether one sentence entails or 
implies another sentence, as used in the Recognising Textual Entailment 



Challenges (Dagan, Glickman, and Magnini 2005 1. 



Our context-theoretic notion of entailment is thus intended to generalise both the first 
two senses of entailment above. In addition, we hope that context theories will be useful 
in the practical application of recognising textual entailment. 

Our definition is more general than the model-theoretic and hypernymy notions of 
entailment however, as it allows the measurement of a degree of entailment between 
any two strings: as an extreme example, one may measure the degree to which "not a" 
entails "in the". Whilst this may not be useful or philosophically meaningful, we view 
it as a practical consequence of the fact that every string has a vector representation in 
our model, which coincides with the current practice in vector-based compositionality 
techniques. 



2.4 Lattice ordered algebras 



A large class of context theories make use of a lattice ordered algebra which merges the 
lattice ordering of the vector space V with the product of A. 



Definition 7 (Partially ordered algebra) 

A partially ordered algebra A is an algebra which is also a partially ordered vector 
space, which satisfies u ■ v > for all u, v G A + . If the partial ordering is a lattice, then 
A is called a lattice ordered algebra. 



Example 9 (Lattice ordered algebra of matrices) 

The matrices of order n form a lattice ordered algebra under normal matrix multi- 
plication, where the lattice operations are defined as the entry-wise minimum and 
maximum. 



Example 10 (Operators on l p spaces) 

Operators on the l p spaces are also lattice ordered algebras, by the Riesz-Kantorovich 
theorem ( |Abramovich and Aliprantis 2002| , with the operations defined by: 

(S V T)(u) = sup{S(v) + T(w) :v,w£U + and v + w = u} 
(S A T)(u) = inf{5(«) + TO) : v, w e U + and v + w = u} 



If A is a lattice ordered algebra which is also an abstract Lebesgue space, then 
(A,A,£,A,1) is a context theory. Many of the examples we discuss will be of this 
form, so we will use the shorthand notation, (A,A,£). It is tempting to adopt this as 
the definition of context theory, however, as we will see, this is not supported by our 
prototypical example of a context theory (which we will introduce in the next section) 
as in this case the algebra is not necessarily lattice ordered. 
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3. Context Algebras 

In this section we describe the prototypical examples of a context theory, the context 
algebras. The definition of a context algebra originates in the idea that the notion of 
"meaning as context" can be extended beyond the word level to strings of arbitrary 
length. In fact, the notion of context algebra can be thought of as a generalisation of the 
syntactic monoid of a formal language: instead of a set of strings defining the language, 
we have a fuzzy set of strings, or more generally, a real-valued function on a free monoid. 

Definition 8 (Real-valued language) 

Let A be a finite set of symbols. A real-valued language (or simply a language when 
there is no ambiguity) L on A is a function from A* to R. If the range of L is a subset of 
M+ then L is called a positive language. If the range of L is a subset of [0, 1] then L is 
called a fuzzy language. If L is a positive language such that J2xeA> L(x) = 1 then L is 
a probability distribution over A*. 

The following inclusion relation applies amongst these classes of language: 

distribution => fuzzy positive => real-valued 

Since A* is a countable set, the set R A " of functions from A* to K is isomorphic 
to the sequence space, and we shall treat them equivalently. We denote by £ P (A*) the 
set of functions with a finite £ p norm, when considered as sequences. There is another 
heirarchy of spaces given by the inclusion of the £ p spaces: £ P (A*) C £i(A*) if p < q. In 
particular, 

£ 1 {A*) C £ 2 (A*) C £°°(A*) C R A ' 

where the £°° norm gives the maximum value of the function and £°°(A*) is the space of 
all bounded functions on A*. 

Note that probability distributions are in ^(A*) and fuzzy languages are in £ X (A*). 
If L e ^ 1 (^4*) + (the space of positive functions on A* such that the sum of all values 
of the function is finite) then we can define a probability distribution p L over A* by 
Pl(x) = L(x)/\\L\\ 1 . Similarly, if L G £°°(A*) + (the space of bounded positive functions 
on ^4*) then we can define a fuzzy language f L by = L(x)/\\L\\ X . 

Example 11 

Given a finite set of strings C C A*, which we may imagine to be a corpus of documents, 
define L(x) = l/\C\ if x e C, or otherwise. Then L is a probability distribution over 

A*. 

Example 12 

Let L be a language such that L(x) — for all but a finite subset of A*. Then L e £ P (A*) 
for all p. 

Example 13 

Let L be the language defined by L(x) = \x\ where x is the length of (i.e. number of 
symbols in) string x. Then L is a positive language which is not bounded: for any string 
y there exists a z such that L(z) > L(y), for example z = ay for a e A. 
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Example 14 

Let L be the language defined by L(x) = 1/2 for all x. Then L is a fuzzy language but 

Lil x {A*) 

We will assume now that L is fixed, and consider the properties of contexts of strings 
with respect to this language. 

Definition 9 (Context vectors) 

Let L be a language on A. For x £ A* , we define the context of a; as a vector x £ M A * Xj4 *, 
i.e. a real-valued function on pairs of strings: 

x{y,z) = L(yxz). 

Our thesis is centred around these vectors, and it is their properties that form the 
inspiration for the context-theoretic framework. 

The question we are addressing is: does there exist some algebra A containing the 
context vectors of strings in A* such that x ■ y = xy where x, y £ A* and • indicates 
multiplication in the algebra? As a first try, consider the vector space L°°(A* x A*) in 
which the context vectors live. Is it possible to define multiplication on the whole vector 
space such that the condition just specified holds? 

Example 15 

Consider the language C on the alphabet A = {a,b,c,d,e, /} defined by C(abcd) = 
C(aecd) — C(abfd) = | and C(x) = for all other x £ A*. Now if we take the shorthand 
notation of writing the basis vector in L°°(A* x A*) corresponding to a pair of strings as 
the pair of strings itself then 

b = \(a,cd) + | (a, fd) 
c = 5 (ab, d) + | (ae, d) 
be — I (a, d) 

It would thus seem sensible to define multiplication of contexts so that \{a,cd) ■ 

\(ab, d) = i(a, d). However we then find 

e • / = | (a, cd) ■ | (ab, d) ^ ef = 

showing that this definition of multiplication doesn't provide us with what we are 
looking for. In fact, if there did exist a way to define multiplication on contexts in 
a satisfactory manner it would necessarily be far from intuitive, as, in this example, 
we would have to define (a, cd) ■ (ab, d) = meaning the product b ■ c would have to 
have a non-zero component derived from the products of context vectors (a, fd) and 
(ae, d) which don't relate at all to the contexts of be. This leads us to instead define 
multiplication on a subspace of L X (A* x A*). 

Definition 10 (Generated Subspace A) 

The subspace A of L°°(A* x A*) is the set defined by 

A= {a : a = a x x for some a x £ M} 

xeA* 
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Because of the way we define the subspace, there will always exist some basis B = 
{u : u G B} where B C A*, and we can define multiplication on this basis by u ■ v = 
uv where u, v e B. Defining multiplication on the basis defines it for the whole vector 
subspace, since we define multiplication to be linear, making A an algebra. 

However there are potentially many different bases we could choose, each corre- 
sponding to a different subset of A*, and each giving rise to a different definition of 
multiplication. Remarkably, this isn't a problem: 

Proposition (Context Algebra) 

Multiplication on A is the same irrespective of the choice of basis B. 
Proof 

We say B C A* defines a basis B for A when B is a basis such that B = {x : x e B}. 
Assume there are two sets Bi,B 2 C A* that define corresponding bases B\ and B 2 for 
A. We will show that multiplication in basis B\ is the same as in the basis B 2 . 

We represent two basis elements ii\ and u 2 of B\ in terms of basis elements of B 2 : 

u\ = aiii and u 2 = ^ PjVj, 

i 3 

for some Ui e B\, vj e B 2 and (Xi,f3j G BL First consider multiplication in the basis 
B\. Note that iii = J2i a i^i means that L(xuiy) — J2i onL(xviy) for all x,y <E A*. This 
includes the special case where y = u 2 y' so 

L(xuiu 2 y') = aiL(xviU 2 y') 

i 

for all x, y' e A*. Similarly, we have L(xu 2 y) — J2j PjL(xvjy) for all x,y E A* which 
includes the special case x — x'v ir so L(x'viU 2 y) = J2j PjL(x'v i v J y) for all x',y e A*. 
Inserting this into the above expression yields 

L(xuiu 2 y) =y^a i P j L(xv i v j y) 

for all x, y e A* which we can rewrite as 

ux ■ u 2 = urui = ^2 cti/3j(vi ■ vj) = aij3jV^v r 

i,3 ij 

Conversely, the product of ui and u 2 using the basis B 2 is 

i j ij 

thus showing that multiplication is defined independently of what we choose as the 
basis. ■ 
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Example 16 

Returning to the previous example, we can see that in this case multiplication is in fact 
defined on L X (A* x A*) since we can describe each basis vector in terms of context 
vectors: 

(a, fd) ■ (ae, d) = 3(6 - e) • 3(c - /) = -3(a, d) 
(a, cd) • (ae, d) = 3e • 3(c - /) = 3(a, d) 
(a, /d) • (a6, d) = 3(6 - e) • 3/ = 3(a, d) 
(a, cd) • (a6, d) = 3e ■ 3/ = 0, 

thus confirming what we predicted about the product of 6 and c: the value is only correct 
because of the negative correction from (a, fd) ■ (ae, d). This example also serves to 
demonstrate an important property of context algebras: they do not satisfy the positivity 
condition; i.e. it is possible for positive vectors (those with all components greater than 
or equal to zero) to have a non-positive product. This means they are not necessarily 
partially ordered algebras under the normal partial order. Compare this to the case of 
matrix multiplication, for example, where the product of two positive matrices is always 
positive. 

The notion of a context theory is founded on the prototypical example given by 
context vectors. So far we have shown that multiplication can be defined on the vector 
space A generated by context vectors of strings, however we have not discussed the 
lattice properties of the vector space. In fact, A does not come with a natural lattice 
ordering that makes sense for our purposes, however, the original space R A " xA ' does — 
it is isomorphic to the sequence space. Thus (A, A, £, R A * xA * ,ip) will form our context 
theory, where £(a) = a for a e A and ip is the canonical map which simply maps ele- 
ments of A to themselves, but considered as elements of R A * xA *. There is an important 
caveat here however: we required that the vector lattice be an abstract Lebesgue space, 
which means we need to be able to define a norm on it. The I 1 norm on M- 4 **- 4 * is an 
obvious candidate, however it is not guaranteed to be finite. This is where the nature of 
the underlying language L becomes important. 

We might hope that the most restrictive class of the languages we discussed, the 
probability distributions over A* would guarantee that the norm is finite. Unfortunately, 
this is not the case, as the following example demonstrates. 

Example 17 

Let L be the language defined by 

L(a 2 ") = l/2" +1 

for integer n > 0, and zero otherwise, where by a n we mean n repetitions of a, so for 
example, L(a) = |, L(aa) = \, L(aaa) = and L(aaaa) = |. Then L is a probability 
distribution over A*, since L is positive and = 1. However Hall! is infinite, since 
each string x for which L(x) > contributes 1/2 to the value of the norm, and there are 
an infinite number of such strings. 

The problem in the previous example is that the average string length is infinite. 
If we restrict ourselves to probability distributions over A* in which the average string 
length is finite, then the problem goes away. 
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Proposition 

Let L be a probability distribution over A* such that 

L= ]Tl(*)M 

xeA* 

is finite, where |a:| is the number of symbols in string x; we will call such languges finite 
average length languges. Then ||y||i is finite for each y e A*. 

Proof 

Denote the number of occurrences of string y as a substring of string a; by \x\ y . Clearly 
\x\ y < \x\ for all x, y e A*. Moreover, 

||y||i= ^2 L{x)\x\ y < X)£WW 

xeA* xeA* 

and so ||y||i < L is finite for all y G A*. ■ 

If L is finite average length, then A C ^(A* x A*), and so (^4,^,^,^ 1 (A* x is 
a context theory, where ip is the canonical map from A to ^(A* x A*). Thus context al- 
gebras of finite average length languages provide our prototypical examples of context 
theories. 

3.1 Discussion 

The benefit of the context-theoretic framework is in providing a space of exploration for 
models of meaning in language. Our effort has been in finding principles by which to 
define the boundaries of this space. Each of the key boundaries, namely, bilinearity and 
associativity of multiplication, and entailment through vector lattice structure, can also 
be viewed as limitations of the model. 

Bilinearity is a strong requirement to place, and has wide-ranging implications for 
the way meaning is represented in the model. It can be interpreted loosely as follows: 
components of meaning persist or diminish but do not spontaneously appear. This is 
particularly counter-intuitive in the case of idiom and metaphor in language. It means 
that, for example, both red and herring must contain some components relating to the 
meaning of red herring which only come into play when these two words are combined 
in this particular order. Any other combination would give a zero product for these 
components. It is easy to see how this requirement arises from a context-theoretic 
perspective, nevertheless from a linguistic perspective it is arguably undesirable. 

One potential limitation of the model is that it does not explicitly model syntax, but 
rather syntactic restrictions are encoded into the vector space and product itself. For 
example, we may assume the word square has some component of meaning in common 
with the word shave. Then we would expect this component to be preserved in the 
sentences He drew a square and He drew a shave. However, in the case of the two sentences 
The box is square and *The box is shape we would expect the second to be represented 
by the zero vector since it is not grammatical; square can be a noun and an adjective, 
whereas shape cannot. Distributivity of meaning means that the component of meaning 
that square has in common with shape must be disjoint with the adjectival component of 
the meaning of square. 
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Associativity is also a very strong requirement to place; indeed |Lambek (1961| l 
introducted non-associativity into his calculus precisely to deal with examples that were 
not satisfactorily dealt with by his associative model (Lambe k 1958b . 

Whilst we hope that these features or boundaries are useful in their current form, 
it may be that with time, or for certain applications there is a reason to expand or 
contract certain of them, perhaps because of theoretical discoveries relating to the model 
of meaning as context, or for practical or linguistic reasons, if, for example, the model is 
found to be too restrictive to model certain linguistic phenomena. 

4. Applications to Textual Entailment 

The only existing framework for textual entailment that we are aware of is that of 
Glickman and Dagan (2005). However this framework does not seem to be general 
enough to deal satisfactorily with many techniques used to tackle the problem since 
it requires interpreting the hypothesis as a logical statement. 

Conversely, systems that use logical representations of language are often imple- 
mented without reference to any framework, and thus deal with the problems of repre- 
senting the ambiguity and uncertainty that is inherent in handling natural language in 
an ad-hoc fashion. 

Thus it seems what is needed is a framework which is general enough to satisfacto- 
rily incorporate purely statistical techniques and logical representations, and in addition 
provide guidance as to how to deal with ambiguity and uncertainty in natural language. 
It is this that we hope our context-theoretic framework will provide. 

In this section we analyse approaches to the textual entailment problem, showing 
how they can be related to the context-theoretic framework, and discussing potential 
new approaches that are suggested by looking at them within the framework. We first 
discuss some simple approaches to textual entailment based on subsequence matching 
and measuring lexical overlap. We then look at how Glickman and Dagan's approach 
can be considered as a context theory in which words are represented as projections 
on the vector space of documents. This leads us to an implementation of our own in 
which we used latent Dirichlet allocation as an alternative approach to overcoming the 
problem of data sparseness. 

4.1 Subsequence Matching and Lexical Overlap 

We call a sequence x € A* a "subsequence" of y 6 A* if each element of x occurs in 
y in the same order, but with the possibility of other elements occurring in between, 
so for example abba is a subsequence of acabcba in {a, b, c}*. Subsequence matching 
compares the subsequences of two sequences: the more subsequences they have in 
common the more similar they are assumed to be. This idea has been used successfully 
in text classification (Lod hi et al. 20 02) and also formed the basis of the author's entry 
to the second Recognising Textual Entailment Challenge (Clarke 2006). 

If S is a semigroup, £ (S) is a lattice ordered algebra under the multiplication of 
convolution: 

(/•<?)(*)= E 

yz=x 

where x,y, z 6 S, f, g 6 ^(S). 
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Example 18 (Subsequence matching) 

Consider the algebra l Y (A*) for some alphabet A. This has a basis consisting of elements 
e x for x <E A*, where e x the function that is 1 on i and elsewhere. In particular e £ is 
a unity for the algebra. Define £(a) = |(e a + e e ); then (A,£ 1 (S),£,) is a context theory. 
Under this context theory, a sequence x completely entails y if and only if it is a sub- 
sequence of y. In our experiments, we have shown that this type of context theory can 
perform significantly better than straightforward lexical overlap (Clarke 2006). Many 
variations on this idea are possible, for example using more complex mappings from A* 

toe(A"). 

Example 19 (Lexical overlap) 

The simplest approach to textual entailment is to measure the degree of lexical over- 
lap: the proportion of words in the hypothesis sentence that are contained in the text 
sentence (Dagan, Glickman, and Magnini 2005). This approach can be described as a 
context theory in terms of a free commutative semigroup on a set A, defined by A*/ = 
where x = y in A* if the symbols making up x can be reordered to make y. Then 
define £' by £'(a) = \{ e [a] + eu) where [a] is the equivalence class of a in A* / =. Then 
(A,£ 1 (S/ =),£') is a context theory in which entailment is defined by lexical overlap. 
More complex definitions of x can be used, for example to weight different words by 
their probabilities. 

4.2 Document Projections 

Glickman and Dagan (2005} give a probabilistic definition of entailment in terms of 



'possible worlds" which they use to justify their lexical entailment model based on oc- 
currences of words in web documents. They estimate the lexical entailment probability 

LEP(u, v) to be 

LEP(li, v) — 



where n v and n UA , denote the number of documents that the word v occurs in and the 
words u and v both occur in respectively. From the context theoretic perspective, we 
view the set of documents the word occurs in as its context vector. To describe this 
situation in terms of a context theory, consider the vector space ^(D) where D is the 
set of documents. With each word u we associate an operator P u on this vector space by 



P u ed = 



Cd if u occurs in document d 
otherwise. 



where is the basis element associated with document d e D. P u is a projection, that 
is P U P U — Pu', it projects onto the space of documents that u occurs in. These projec- 
tions are clearly commutative (they are in fact band projections): P U P V — P V P U = P u A P v 
projects onto the space of documents in which both u and v occur. 

In their paper, Glickman and Dagan assume that probabilities can be attached to 
individual words, as we do, although they interpret these as the probability that a 
word is "true" in a possible world. In their interpretation, a document corresponds to a 
possible world, and a word is true in that world if it occurs in the document. 

They do not, however, determine these probabilities directly; instead they make 
assumptions about how the entailment probability of a sentence depends on lexical 
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entailment probability. Although they do not state this, the reason for this is presumably 
data sparseness: they assume that a sentence is true if all its lexical components are true: 
this will only happen if all the words occur in the same document. For any sizeable 
sentence this is extremely unlikely, hence their alternative approach. 

It is nevertheless useful to consider this idea from a context theoretic perspective. 
The probability of a term being true can be estimated as the proportion of documents 
it occurs in. This is the same as the context theoretic probability defined by the linear 
functional <\>, which we may think of as determined by a vector p in L°°(D) given by 
p(d) = 1/| I? | for all d 6 D. In general, for an operator U on L°°(£>) the context theoretic 
probability of U is defined as 

<Ku) = \\u + p\\i-\\u- p \\i, 

where U + = U V and U~ = (— U) V Q and the lattice operations are defined by the 
Riesz-Kantorovich formula (Example [TOl . The probability of a term is then 4>(P U ) = 
n u /\D\. More generally, the context theoretic representation of an expression x — 
U1U2 ■ ■ - U m is P x = P Ul P U2 ■ ■ -Pu m - This is clearly a semigroup homomorphism (the 
representation of xy is the product of the representations of x and y), and thus together 
with the linear functional <p defines a context theory for the set of words. 

The degree to which x entails y is then given by 4>(P X A P y )/(f>(P x ). This corresponds 
directly to Glickman and Dagan's entailment "confidence"; it is simply the proportion 
of documents that contain all the terms of x which also contain all the terms of y. 

4.3 Latent Dirichlet Projections 

The formulation in the previous section suggests an alternative approach to that of 
Glickman and Dagan to cope with the data sparseness problem. We consider the finite 
data available D as a sample from a corpus model D'; the vector p then becomes a 
probability distribution over the documents in D' . In our own experiments, we used 
latent Dirichlet allocation i jBlei, Ng, and Jordan 2003| to build a corpus model based on 
a subset of around 380,000 documents from the Gigaword corpus. Having the corpus 
model allows us to consider an infinite array of possible documents, and thus we can 
use our context-theoretic definition of entailment since there is no problem of data 
sparseness. 

Latent Dirichlet allocation (LDA) follows the same vein as Latent Semantic Anal- 
ysis (LSA) dDeerwester et al. 19901 1 and Probabilistic Latent Semantic Analysis (PLSA) 
dHofmann 19991 1 in that it can be used to build models of corpora in which words within 
a document are considered to be exchangeable; so that a document is treated as a bag 
of words. LSA performs a singular value decomposition on the matrix of words and 
documents which brings out hidden "latent" similarities in meaning between words, 
even though they may not occur together. 

In contrast PLSA and LDA provide probabilistic models of corpora using Bayesian 
methods. LDA differs from PLSA in that, while the latter assumes a fixed number of 
documents, LDA assumes that the data at hand is a sample from an infinite set of 
documents, allowing new documents to be assigned probabilities in a straightforward 
manner. 

Figure [3] shows a graphical representation of the latent Dirichlet allocation genera- 
tive model, and figure |4] shows how the model generates a document of length N. In 
this model, the probability of occurrence of a word it; in a document is considered to be 
a multinomial variable conditioned on a /c-dimensional "topic" variable z. The number 
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Figure 3 

Graphical representation of the Dirichlet model. The inner box shows the choices that are 
repeated for each word in the document; the outer box the choice that is made for each 
document; the parameters outside the boxes are constant for the model. 



1. 


Choose 8 <~ Dirichlet(a) 


2. 


For each of the N words: 


(a) 


Choose z ~ 




Multinomial(#) 


(b) 


Choose w according to 




p(w\z) 



Figure 4 

Generative process assumed in the Dirichlet model 



of topics k is generally chosen to be much fewer than the number of possible words, 
so that topics provide a "bottleneck" through which the latent similarity in meaning 
between words becomes exposed. 

The topic variable is assumed to follow a multinomial distribution parameterised 
by a fc-dimensional variable 6, satisfying 



i=l 



and which is in turn assumed to follow a Dirichlet distribution. The Dirichlet distribu- 
tion is itself parameterised by a /c-dimensional vector a. The components of this vector 
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can be viewed as determining the marginal probabilities of topics, since: 

p(zi) = [ P { Zl \9) P {9)d6 
6 iP {6)d6. 

This is just the expected value of 9i, which is given by 

The model is thus entirely specified by a and the conditional probabilies p(w\z) 
which we can assume are specified inaixV matrix f3 where V is the number of 
words in the vocabulary The parameters a and can be estimated from a corpus 
of documents by a variational expectation maximisation algorithm, as described by 
Blei, Ng, and Jordan (2003) . 



Latent Dirichlet allocation was applied by Blei, Ng, and Jordan (2003} to the tasks of 



document modelling, document classification and collaborative filtering. They compare 
latent Dirichlet allocation to several techniques including probabilistic latent semantic 
analysis; latent Dirichlet allocation outperforms these on all of the applications. Re- 
cently latent Dirichlet allocation has been applied to the task of word sense disambigua- 
tion ( jCai, Lee, and T eh 2007: Boyd-Graber, Blei, and Zhu 2007} with significant success. 



Consider the vector space L°°(A*) for some alphabet A, the space of all bounded 
functions on possible documents. In this approach, we define the representation of a 
string x to be a projection P x on the subspace representing the (infinite) set of documents 
in which all the words in string x occur. Again we define a vector q(x) for where 
q(x) is the probability of document x in the corpus model, we then define a linear 
functional <f> for an operator U on L°° (A*) as before by (j>(U) = \\U + q\\i — \\U~q\\i. <p(P x ) 
is thus the probability that a document chosen at random contains all the words that 
occur in string x. In order to estimate <fr(P x ) we have to integrate over the Dirichlet 
parameter 9: 



4>{Px)= I [Y[pe(a)) p(9)d9 



where by a £ x we mean that the word a occurs in string x, and pe(a) is the probability 
of observing word a in a document generated by the parameter 9. We estimate this by 



w (o)s!l-|l-^p(o|«)pW8)) 



where we have assumed a fixed document length N. The above formula is an estimate 
of the probability of a word occurring at least once in a document of length N, the sum 
over the topic variable z is the probability that the word a occurs at any one point in 
a document given the parameter 9. We approximated the integral using Monte-Carlo 
sampling to generate values of 9 according to the Dirichlet distribution. 
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Model 


Accuracy 


cws 


Dirichlet (lO 5 ) 


0.584 


0.630 


Dirichlet (10 7 ) 


0.576 


0.642 


Bayer (MITRE) 


0.586 


0.617 


Glickman (Bar Ilan) 


0.586 


0.572 


Jijkoun (Amsterdam) 


0.552 


0.559 


Newman (Dublin) 


0.565 


0.6 



Table 2 

Results obtained with our Latent Dirichlet projection model on the data from the first 
Recognising Textual Entailment Challenge for two document lengths N — 10 s and N — 10 7 
using a cut-off for the degree of entailment of 0.5 at which entailment was regarded as holding. 



We built a latent Dirichlet allocation model using |Blei, Ng, and Jordan (20"03) 's 
implementation on documents from the British National Corpus, using 100 topics. 
We evaluated this model on the 800 entailment pairs from the first Recognising Tex- 
tual Entailment Challenge test set0 Results were comparable to those obtained by 



Glickman and Dagan (2005 1 (see Table |2). In this table, Accuracy is the accuracy on the 



test set, consisting of 800 entailment pairs, and CWS is the confidence weighted score; 
see ( |Dagan, Glickmanja nd Magnini 2005) for the definition. The differences between 
the accuracy values in the table are not statistically significant because of the small 
dataset, although all accuracies in the table are significantly better than chance at the 1% 
level. The accuracy of the model is considerably lower than the state of the art, which 
is around 75% (Bar-Haim et al. 2006). We experimented with various document lengths 
and found very long documents (N = 10 6 and N = 10 7 ) to work best. 

It is important to note that because the LDA model is commutative, the resulting 
context algebra must also be commutative, which is clearly far from ideal in modelling 
natural language. 



5. The Model of Clark, Coecke and Sadrzadeh 

One of the most sophisticated proposals for a method of composition is that 
of Clark, Coecke, and Sadrzadeh (2008) and the more recent implementation of 
( Grefenste tte et al. 20111 ). In this section, we will show how their model can be described 
as a context theory 

The authors describe the syntactic element of their construction using pregroups 
(Lambek 2001), a formalism which simplifies the syntactic calculus of (Lambek 1958). 



3 We have so far only used data from the first challenge, since we performed the experiment before the 
other challenges had taken place. 
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These can be described in terms of partially ordered monoids, a monoid G with a 
partial ordering < satisfying x < y implies xz < yz and zx < zy for all x,y, z e G. 

Definition 11 (Pregroup) 

Let G be a partially ordered monoid. Then G is called a pregroup if for each ieG there 
are elements x l and x r in G such that 

(1) x l x < 1 

(2) xx r < 1 

(3) 1 < xx 1 

(4) 1 < x r x 



lfx,y G G, we call y a reduction of x if y can be obtained from x using only rules (1) and 
(2) above. 

Pregroup grammars are defined by freely generating a pregroup on a set of basic 
grammatical types. Words are then represented as elements formed from these basic 
types, for example: 



John likes Mary 

7T lT r SO l O 



where ir, s and o are the basic types for first person singular, statement and object, 
respectively. It is easy to see that the above sentence reduces to type s under the 
pregroup reductions. 

As Clark, Coecke, and Sadrzadeh (2008) note, their construction can be generalised 
by endowing the grammatical type of a word with a vector nature, in addition to its 
semantics. We use this slightly more general construction to allow us to formulate it 
in the context-theoretic framework. We define an elementary meaning space to be the 
tensor product space V = S (g) P where S is a vector space representing meanings of 
words and P is a vector space with an orthonormal basis corresponding to the basic 
grammatical types in a pregroup grammar and their adjoints. We assume that meanings 
of words live in the tensor algebra space T(V), defined by 



T(V) =R®V ®(V ®V)®{V®V (g> F) © • • • 



For an element v in a particular tensor power of V, such that v — (si (g) p\) <g> (s2 €5 P2) <8> 
■ • • <S> (s n ®p n ), where the pi are basis vectors of P, then we can recover a complex 
grammatical type for v as the product 7(f) = 7172 • • • "fn, where 7^ is the basic gram- 
matical type corresponding to pi. We will call the vectors such as this which have a 
single complex type (i.e. they are not formed from a weighted sum of more than one 
type) unambiguous. 

We also assume that words are represented by vectors whose grammatical type is 
irreduceable, i.e. there is no pregroup reduction possible on the type. We define T(T(V)) 
as the vector space generated by all such vectors. 

We will now define a product • on T(T(V)) that will make it an algebra. To do this, it 
suffices to define the product between two elements ui, which are unambiguous and 
whose grammatical type is basic, i.e. they can be viewed as elements of V. The definition 
of the product on the rest of the space follows from the assumption of distributivity. We 
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define: 

( ui (g> U2 if j(ui £3> u 2 ) is irreduceable 
Ui ■ u 2 = < 

[(ui,it2) otherwise. 

This product is bilinear, since for a particular pair of basis elements, only one of the 
above two conditions will apply, and both the tensor and inner products are bilinear 
functions. Moreover, it corresponds to composed and reduced word vectors, as defined 
in < |Clark, Coecke, andS adrzad eh 2008[ >. 

To see how this works on our example sentence above, we assume we have vectors 
for the meanings of the three words, which we write as v wot d. We assume for the purpose 
of this example that the word like is represented as a product state composed of three 
vectors, one for each basic grammatical type. This removes any potentially interesting 
semantics, but allows us to demonstrate the product in a simple manner. We write this 
as follows: 

John likes Mary 

{V]0hn®e v ) ■ (wi ikeM (g) e w r) ■ (viikes,2 ® e s ) ■ (uiikes,3 ® e Q i) ■ (l^Mary ® e Q ) 

where e 7 is the orthornormal basis vector corresponding to basic grammatical type 
7. More interesting representations of like would consist of sums over similar vectors. 
Computing this product from left to right: 

(wjohn ® e„) ■ (wiikes.l ® e T r). (wi ikeSi 2 ® e s ) ■ («likes : 3 ® € l) ■ {v M ary ® e G ) 

= (vjobn, Wlikes.l) ( w likes,2 <E> e s ) ■ («likes : 3 <8> e l) ■ («Mary ® e G ) 

= («John, «likes,l) ( w likes,2 ® e s ) <g) (wiikes : 3 ® e„l) ■ (^Mary ® e G ) 

= (^ohn,^toes,l)( 1; Ukes,3 ! 'yMary) (viikes,2 ® e s ) 

As we would expect in this simplified example the product is a scalar multiple of the 
second vector for like, with the type of a statement. 

This construction thus allows us to represent complex grammatical types, similar to 
Clark, Coecke, and Sadrzadeh (2008), however it also allows us to take weighted sums 
of these complex types, giving us a powerful method of expressing the syntactic and 
semantic ambiguity of lexical semantics. 

6. Conclusions and Future Work 

We have presented a context-theoretic framework for natural language semantics. The 
framework is founded on the idea that meaning in natural language can be determined 
by context, and is inspired by techniques that make use of statistical properties of 
language by analysing large text corpora. Such techniques can generally be viewed as 
representing language in terms of vectors. These techniques are currently used in appli- 
cations such as textual entailment recognition, however the lack of a theory of meaning 
that incorporates these techniques means that they are often used in a somewhat ad- 
hoc manner. The purpose behind the framework is to provide a unified theoretical 
foundation for such techniques so that they may used in a principled manner. 

By formalising the notion of "meaning as context" we have been able to build a 
mathematical model that informs us about the nature of meaning under this paradigm. 
Specifically, it gives us a theory about how to represent words and phrases using 
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vectors, and tells us that the product of two meanings should be distributive and 
associative. It also gives us an interpretation of the inherent lattice structure on these 
vector spaces as defining the relation of entailment. It also tells us how to measure the 
size of the vector representation of a string in such a way that the size corresponds to 
the probability of the string. 

We have demonstrated that the framework encompasses several related ap- 
proaches to compositional distributional semantics, including those based on 



a predefined composition operation such as addition {Mitchell and Lapata 2008 
ILandauer and Dumais 1997] |Foltz, Kintsch, and Landauer 19981 1 ° r me tensor prod 



uct {Smolensky 1990| Park and Pulman 20071 IWiddows 2008)1 , matrix multiplica- 



(Srr 
JRi 



tion {Rudolph and Giesbrecht 2010| |, and the more sophisticated construction of 
Clark, Coecke, and Sadrzadeh (2008| |. 



6.1 Practical Investigations 

Section S] raises many possibilities for the design of systems to recognise textual entail- 
ment within the framework 



Variations on substring matching: experiments with different weighting 
schemes for substrings, allowing partial commutativity of words or 
phrases, and replacing words with vectors representing their context, 
using tensor products of these vectors instead of concatenation. 

Extensions of Glickman and Dagan's approach and our own 
context-theoretic approach using latent Dirichlet allocation, perhaps using 
other corpus models based on n-grams or other models in which words do 
not commute, or a combination of context theories based on commutative 
and non-commutative models. 

The LDA model we used is a commutative one. This is a considerable 
simplification of what is possible within the context-theoretic framework; 
it would be interesting to investigate methods of incorporating 
non-commutativity into the model. 

Implementations based on the approach to representing uncertainty in 
logical semantics similar to those described in (Clarke 20Q7]|- 



All of these ideas could be evaluated using the data sets from the Recognising Textual 
Entailment Challenges. 

There are many approaches to textual entailment that we have not considered here; 
we conjecture that variations of many of them could be described within our frame- 
work. We leave the task of investigating the relationship between these approaches and 
our framework to further work. 

Other areas that we are investigating, together with researchers at the University 
of Sussex, is the possibility of learning finite-dimensional algebras directly from corpus 
data, along the lines of {Guevara 20111 and i Baroni and Zamparelli 20T0| |. 

One question we have not addressed in this paper is the feasibility of computing 
with algebraic representations. Although this question is highly dependent on the 
particular context theory chosen, it is possible that general algorithms for computation 
within this framework could be found; this is another area that we intend to address in 
further work. 
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6.2 Theoretical Investigations 



Although the context-theoretic framework is an abstraction of the model of meaning as 
context, it would be good to have a complete understanding of the model and the types 
of context theories that it allows. Tying down these properties would allow us to define 
algebras that could truly be called "context theories". 

The context-theoretic framework shares a lot of properties with the study of free 
probability ( Voiculescu 1997). It would be interesting to investigate whether ideas from 
free probability would carry over to context-theoretic semantics. 

Although we have related our model to many techniques described in the liter- 
ature, we still have to investigate its relationship with other models such as that of 
Song and Bruza (2003 ) and Guevara (2011) . 



We have not given much consideration here to the issue of multi-word expres- 
sions and non-compositionality. What predictions does the context-theoretic framework 
make about non-compositionality? Answering this may lead us to new techniques for 
recognising and handling multi-word expressions and non-compositionality. 

Of course it is hard to predict the benefits that may result from what we have 
presented, since we have given a way of thinking about meaning in natural language 
that in many respects is new. This new way of thinking opens the door to the unification 
of logic-based and vector-based methods in computational linguistics, and the potential 
fruits of this union are many. 
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